home *** CD-ROM | disk | FTP | other *** search
- WWW folks may like to comment on this, posted to wais-talk and
- cni-arch... Sorry if you've already read it there !
-
- -- Jean-Francois
-
- ------- Start of forwarded message -------
-
- From: connolly@pixel.convex.com
- To: wais-talk@Think.COM
- Cc: cni-arch@uccvma.BITNET
- Subject: Re: Document identifiers
- Date: Mon, 02 Dec 91 01:32:36 CST
-
-
- >The Coalition for Networked Information
- >Architectures & Standards Working Group
- >
- I don't like the direction this technology is headed.
-
- What is the desired functionality of these identifiers?
-
- If you want an identifier that uniquely identifies a file,
- why not use a checksum, such as returned by the unix
- sum command?
-
- Let's see how a checksum solves these issues, and then see
- what functionality I'd like to see in stead.
-
- >1. The need for identifiers, as distinct from location
- >information. This is best handled by a number (much like an
- >ISSN or ISBN), but the system must accomodate multiple
- >number-assigning agencies. Thus, the identifier is proposed
- >as <numbering-authority>,<identifier> where numbering
- >authorities are registered.
- >
- There's no location info in a checksum. Done deal.
-
- >2. The pointers must be representable as an ASCII string to
- >facilitate inclusion in a wide range of material, including
- >documents and electronic mail.
- >
- Check.
-
- >3. Location information must support multiple Locations for
- >the document, including the "location of record" and one or
- >more redistribution centers, local caches, etc. The means of
- >specifying a location should be sufficiently general to span
- >at least the set of networks covered under the Internet
- >Domain Naming system (DNS).
- >
- Ah! Now we want to be able to get location info out of the
- identifier. Checksums don't help. Well, in fact, they help
- no more or less than <numbering authority>-<id> helps, unless
- a numbering authority implies a location. I'm not clear on
- this at all.
-
- >4. Objects may be retrieved by a variety of access
- >mechanisms from servers, including FTP, LISTSERV, Z39.50,
- >and perhaps FTAM and SQL-based database access, as well as
- >requests for paper copies. The location information should
- >be sufficiently general to include information about these
- >different types of access techniques, and extensible to
- >include new access methods that may develop in future.
- >
- Hmmm... now it looks like the doc id should tell how to
- get the document... but not exactly. What we're relly looking
- for is some client software that interprets these numbers
- and queries servers. Checksums look as good as anything again.
-
- >5. Perhaps the location identifier should include some
- >information about the format and size of the object; on the
- >other hand, perhaps it should not. Discussion?
- >
- Checksums do not contain type/size info. If that's what we want,
- the checksum idea is no good.
-
- >6. It should be possible to further qualify a reference to a
- >"sublocation" within an object (which would have meaning
- >only to the server that houses it). This is needed, for
- >example, for hypertext-type links. Such a sublocation might
- >be the 25th paragraph of a text, for a hypertext-type
- >pointer.
- >
- Now we raise the question: just what does a document identifier
- identify? Until this item, it appeared that a document was
- a file. Now it's not so clear. Perhaps a document should be anything
- from a single character to a paragraph to a file to a chapter to
- a book to an encyclopedia to a library. That would be a good trick.
- Is that what we're after?
-
- >7. Indirection should be supported. In other words, one
- >should be able to format the location as the name of a
- >server that can be passed the identifier and which would
- >return location information. The protocol mechanism(s) for
- >doing this need to be specified as well.
- >
- Ah. Now the objectives of the location info become more clear.
- Sounds to me like the location is a TCP connection, or enough
- information on how to establish one.
-
- >8. While full rights and permissions data would seem to be
- >outside the scope of such a pointer, it might be useful to
- >include at least some basic information. This might be an
- >indication that the object is not copyrighted and can be
- >freely distributed, that it is copyrighted but can be freely
- >distributed, that it can be redistributed for noncommercial
- >use, or that restrictions apply to redistribution. Also, it
- >might make sense to include a pointer of some sort (an
- >e-mail address? a host address?) for further information
- >about rights.
- >
- Ack! This stuff seems totally orthogonal to the rest of the
- stuff, but in practice, this looks like a crucial issue.
- I don't have any good ideas here.
-
- >9. Perhaps there might be some type of checksum that can be
- >calculated on the retrieved object to ensure that the
- >pointer and the object have not gotten out of synch?
- >
- This is what sparked the checksum idea.
-
-
- My response to all this:
-
- I don't think we need [yet another] document identifier format.
- If you want location info, use an internet address; if you want
- data integrity, use a checksum; if you want format, we are lacking
- a standard here; if you want copyright info, ditto;
-
- What we need is some nifty client software to glue all the parts
- together. I guess there is some room for standardization, but please:
- LET'S LEVERAGE EXISTING SYSTEMS!
-
- Where these systems are robust, I think we should support them. I'd
- also like to see support for ad-hoc document identifiers. Here's
- an example to clarify:
-
- I'm browsing some email, netnews, or a README file from somewhere.
- I see a reference to more info:
-
- A full discussion of the BLURF protocol is available via
- anonymous FTP from frob.mit.edu as blurf-proto.tex
- in the directory /pub/protos.
-
- I select some or all of that text, and I click one of the buttons
- in my document retrieval tool:
-
- make ftp id -- extract the relevant information and display
- a well-formed identifier acceptable to some
- existing FTP client (I've heard of something
- called ange FTP. Another idea is to make
- a shell script that would do the retrieval:
- ftp frob.mit.edu
- cd /pub/protos
- get blurf-proto.tex
- )
-
- make wais id -- get enough info to make a WAIS doc ID
- [scrap this unless it stabilizes]
- make WWW id -- same thing for World Wide Web HTTP addresses.
- make NNTP id -- same thing for USENET news message id's.
- make LISTSERV id -- you get the idea
- Rather than making up a new format, these id's
- are instructions to EXISTING clients to retrieve
- a document.
-
- verify id -- connect to the necessary server(s) and verify
- that the id references an existing document.
- Append to the id a "verification date," which
- is the last time a server acknowledged the
- existence of the document.
-
- get id info -- connect to the necessary server(s) and get about
- 1K of miscellaneous info: document size in bytes,
- date of last modification, available formats,
- short summary, etc.
-
- retrieve raw -- connect and retrieve the document in whatever
- format is convenient to the server, e.g.
- a compressed tar archive of C and troff sources.
-
- retrieve text -- connect and retrieve the document as
- plain text [defined, e.g. as the body of an
- RFC-822 mail message]
-
- retrieve... -- the user or the supporting client software
- specifies the supported information formats,
- (compression schemes, archiving formats,
- image file formats, typesetting languages)
- the client and the server hash over their options,
- [perhaps with user intervention]
- and the server sends the most desireable version
- of the document it has available.
-
- If we add a few buttons, we begin to encompass the scope of many existing
- systems:
-
- expand -- change the doc id to reference the "document"
- containing it. In the ftp example, rather than
- "get blurf.tex," it would have "ls."
- Click again and get "cd ..; ls."
- Obviously, this operation depends on the access
- mechanism. For WAIS documents, the expansion of
- a document is the source that contains it.
-
- select -- narrow the document to some of its parts. For a
- text file, select some of the characters/paragraphs
- for a WAIS source, select some of the documents.
- For a WWW node, select a neighboring node. For
- a directory, select some files.
-
- I guess my point is, let's think about how folks are going to use this
- document referencing technology, and let's see how well existing systems
- meet these needs.
-
- I guess some groups have come to the conclusion that the existing systems
- don't cut it. I'm beginning to agree.
-
- I guess we'd all agree that we should decide how we're going to use these
- doc id's and let that drive the design of the format. i.e. Let's decide
- on the methods of this object before we decide on its representation.
-
- [an idea: for syntax, the WAIS folks chose LISP. What about using
- something akin to RFC-822 syntax? I think it works well: define a bunch
- of standard headers; require some, allow some, disregard others; allow
- free-form text in the body. examples:
-
- ISBN: 0-13-590126-X
- or
- MESSAGE-ID: usenet-thing
- or
- FTP-HOST: frob.mit.edu
- USER: anonymous
- or
- WAIS-PORT: 8001@think.com
-
-
- This would allow us to leverage all the email technology out there, plus
- the emerging multi-part mail format.
- (and it would allow me to use PERL on these beasties! :-)
- ]
-
- Another thing I hope folks are keeping in mind: I don't think any one
- client can meet the information-retrieval needs of everybody. We need
- to support multiple platforms, for one thing. But I hope other folks are
- considering using mulitple clients at the same time! I'd like to use
- one slick X-windows front end to the whole ball of wax, in some ways like
- emacs does for programming, and in some ways like the mac GUI does for
- office-productivity applications. But I'm going to be using POST mail
- servers, NNTP servers, WAIS servers, FTP servers, etc, and I don't
- expect one client to do it all. The crucial trick is to make all this
- intuitive and interactive, i.e. to support hypertext browsing, fulltext
- retrieval, USENET news reading, and maybe email correspondence, all in
- one environment. Let's get started!
-
- Dan
-
- ------- End of forwarded message -------
-
-